Skip to content

Conversation

@checkupup
Copy link
Contributor

At present, dax instance is initialized and allocated inner buffer on module_init(), and released on module_free(). The memory requirement for the inner buffer per instance is quite large.

The existing implementation has 2 pipelines containing dax. Even though the host is restricted from processing stream on both pipelines simultaneously, they may coexist with each other under some circumstances e.g. the interim when switching PCM device from one to the other, which may drain out the memory.

This commit changes the timing of instance allocation/deallocation to module_prepare()/reset() respectively. That is, dax instance only occupies the inner buffer memory when processing.

@checkupup
Copy link
Contributor Author

PR that references #10503

@sofci
Copy link
Collaborator

sofci commented Feb 6, 2026

Can one of the admins verify this patch?

reply test this please to run this test once

Copy link
Contributor

@johnylin76 johnylin76 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in overall. please help to test them.

struct sof_dax *dax_ctx = module_get_private_data(mod);

/* dax instance will be established on prepare(), and destroyed on reset() */
destroy_instance(mod);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also add if (dax_ctx) check to protect these lines

although I'm not sure if it can be reached. We can always be safe.

@lgirdwood
Copy link
Member

@checkupup @johnylin76 I think we need to consider that DSPs are quite memory constrained devices and the asynchronous nature of PCM/pipeline creation/destruction within ALSA can easily lead to us with some heap fragmentation, especially when we need to allocate very large continuous buffers (and we had this problem with the key phrase buffer).

e.g. please consider

  1. User plays audio to endpoint 1 with DAX instance 1
  2. User selects endpoint2 with DAX instance 2, DAX 1 then frees memory.
  3. Asynchronous audio event occurs before DAX 2 PCM is ready e.g. incoming call, system notification and allocates memory resources for its pipeline.
  4. DAX instance 2 attempts to allocate large continuous buffer and fails due to event 3 fragmenting heap.

If we can allocate the DAX memory resources at init() (by the 1st DAX instance) and then pass/share the resources between DAX instances then we dont have to worry about the heap fragmentation and other audio events.

@checkupup
Copy link
Contributor Author

@checkupup @johnylin76 I think we need to consider that DSPs are quite memory constrained devices and the asynchronous nature of PCM/pipeline creation/destruction within ALSA can easily lead to us with some heap fragmentation, especially when we need to allocate very large continuous buffers (and we had this problem with the key phrase buffer).

e.g. please consider

  1. User plays audio to endpoint 1 with DAX instance 1
  2. User selects endpoint2 with DAX instance 2, DAX 1 then frees memory.
  3. Asynchronous audio event occurs before DAX 2 PCM is ready e.g. incoming call, system notification and allocates memory resources for its pipeline.
  4. DAX instance 2 attempts to allocate large continuous buffer and fails due to event 3 fragmenting heap.

If we can allocate the DAX memory resources at init() (by the 1st DAX instance) and then pass/share the resources between DAX instances then we dont have to worry about the heap fragmentation and other audio events.

Hi, @johnylin76 @lgirdwood. If so, I would suggest to use BSS section for dax memories, similar to implementation in src/audio/google/google_rtc_audio_processing.c:

static __aligned(PLATFORM_DCACHE_ALIGN)
uint8_t aec_mem_blob[CONFIG_COMP_GOOGLE_RTC_AUDIO_PROCESSING_MEMORY_BUFFER_SIZE_KB * 1024];

This:

  1. Avoid memory fragmentation
  2. Avoid draining out the memory when switching PCM device from one to the other, see dax: set instance lifespan from prepare to reset #10503

However, there are some known issues:

  1. How to ensure that multiple DAX wrapper modules can exist at the same time, but only one DAX instance is processing the stream? set instance lifespan from prepare to reset schema also needs this criteria.The different is that the former causes undefined behavior and the latter reports memory allocation fail.
  2. BSS consumes real physical pages but virtual memory does not. I'm afraid this will consume more actual physical memory.

@johnylin76
Copy link
Contributor

If we can allocate the DAX memory resources at init() (by the 1st DAX instance) and then pass/share the resources between DAX instances then we dont have to worry about the heap fragmentation and other audio events.

Hi @lgirdwood,

Could you elaborate this? I'm not sure I have seen an example of module/codec that shares the resource between instances. Thanks

@johnylin76
Copy link
Contributor

Hi, @johnylin76 @lgirdwood. If so, I would suggest to use BSS section for dax memories, similar to implementation in src/audio/google/google_rtc_audio_processing.c:

static __aligned(PLATFORM_DCACHE_ALIGN)
uint8_t aec_mem_blob[CONFIG_COMP_GOOGLE_RTC_AUDIO_PROCESSING_MEMORY_BUFFER_SIZE_KB * 1024];

I'm not sure if BSS is a suitable solution. I'm afraid that may produce other issues. For example, the memory may directly reflect on the built image (*.ri) size and drag down the DSP load time.

How to ensure that multiple DAX wrapper modules can exist at the same time, but only one DAX instance is processing the stream? set instance lifespan from prepare to reset schema also needs this criteria.The different is that the former causes undefined behavior and the latter reports memory allocation fail.

That is indeed gimmicky. On SOF firmware side there is no such constraints actually. What we do so far is to regularize the mutual usage to ALSA PCM devices from the system side.

@johnylin76
Copy link
Contributor

Hi @checkupup @lgirdwood

As we still have lots of thoughts for optimizing the design anyway, can we settle down the current implementation first? By doing so we can have a workable version of DAX-supported firmware solution shortly to unblock the on-going project, while keeping the discussion of optimization in a new-created issue tracker.

Thanks.

@lgirdwood
Copy link
Member

@checkupup @johnylin76 I think we need to consider that DSPs are quite memory constrained devices and the asynchronous nature of PCM/pipeline creation/destruction within ALSA can easily lead to us with some heap fragmentation, especially when we need to allocate very large continuous buffers (and we had this problem with the key phrase buffer).
e.g. please consider

  1. User plays audio to endpoint 1 with DAX instance 1
  2. User selects endpoint2 with DAX instance 2, DAX 1 then frees memory.
  3. Asynchronous audio event occurs before DAX 2 PCM is ready e.g. incoming call, system notification and allocates memory resources for its pipeline.
  4. DAX instance 2 attempts to allocate large continuous buffer and fails due to event 3 fragmenting heap.

If we can allocate the DAX memory resources at init() (by the 1st DAX instance) and then pass/share the resources between DAX instances then we dont have to worry about the heap fragmentation and other audio events.

Hi, @johnylin76 @lgirdwood. If so, I would suggest to use BSS section for dax memories, similar to implementation in src/audio/google/google_rtc_audio_processing.c:

static __aligned(PLATFORM_DCACHE_ALIGN)
uint8_t aec_mem_blob[CONFIG_COMP_GOOGLE_RTC_AUDIO_PROCESSING_MEMORY_BUFFER_SIZE_KB * 1024];

This:

  1. Avoid memory fragmentation
  2. Avoid draining out the memory when switching PCM device from one to the other, see dax: set instance lifespan from prepare to reset #10503

However, there are some known issues:

  1. How to ensure that multiple DAX wrapper modules can exist at the same time, but only one DAX instance is processing the stream? set instance lifespan from prepare to reset schema also needs this criteria.The different is that the former causes undefined behavior and the latter reports memory allocation fail.

Expectation is that module ELF section should behave the same as built-in or as llext modules. i.e. if above BSS solution is causing an error we should log/fix it.

  1. BSS consumes real physical pages but virtual memory does not. I'm afraid this will consume more actual physical memory.

On Intel HW both solutions consume PHY pages, so its not a loss here.

@lgirdwood
Copy link
Member

If we can allocate the DAX memory resources at init() (by the 1st DAX instance) and then pass/share the resources between DAX instances then we dont have to worry about the heap fragmentation and other audio events.

Hi @lgirdwood,

Could you elaborate this? I'm not sure I have seen an example of module/codec that shares the resource between instances. Thanks

I was thinking here we could have a static context pointer in BSS which if NULL would mean DAX would allocate it and if not NULL DAX would us it during init(). That would give us a shared resource with mutual exclusion if we also had an atomic_t int around it for access.

I agree with other comments, this is quite a corner case given we have a high memory requirement on a device with limited memory.

@lgirdwood
Copy link
Member

Hi @checkupup @lgirdwood

As we still have lots of thoughts for optimizing the design anyway, can we settle down the current implementation first? By doing so we can have a workable version of DAX-supported firmware solution shortly to unblock the on-going project, while keeping the discussion of optimization in a new-created issue tracker.

Thanks.

Ack - lets unblock the project and we can use existing PR, fwiw we have a lot of memory updates going in around userspace LL and this is a good example case that we want to support.

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some opens from @johnylin76

@checkupup
Copy link
Contributor Author

Hi, @johnylin76 @lgirdwood , I have verified this change and found no issues. An additional mechanism was added to delay reset and free because I found that when executing sof_dax_reset, dax_process might be still processing. Please help review it, thx.

@johnylin76
Copy link
Contributor

Hi, @johnylin76 @lgirdwood , I have verified this change and found no issues. An additional mechanism was added to delay reset and free because I found that when executing sof_dax_reset, dax_process might be still processing. Please help review it, thx.

That is what I omitted and definitely bad. For LL modules that could never happen. Since DAX is running as DP component, the process function is encapsulated by ring buffers (LL) and running on another thread. As a result, sof_dax_process and sof_dax_reset are asynchronized.

The latest change of yours look logical correct, but I am worried about the race condition. Perhaps the process/reset signal can be implemented by a atomic_t variable in dax_ctx?

@checkupup
Copy link
Contributor Author

checkupup commented Feb 10, 2026

well, I guess that sof_dax_set_configuration and sof_dax_process are also asynchronized. If so, I think I could submit another PR to fix asynchronized problem using spinlock or atomic_t. How do you think? @johnylin76 @lgirdwood

return -EINVAL;
}

dax_ctx->update_flags |= DAX_PROCESSING_MASK;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will suggest a new atomic variable in dax_ctx e.g. atomic_t proc_flags.

#include <rtos/atomic.h>

static void set_dax_proc_flag(..., int32_t flag)
{
  int32_t proc_flags = atomic_read(&dax_ctx->proc_flags);
  atomic_set(&dax_ctx->proc_flags, proc_flags | flag);
}
static bool get_dax_proc_flag(..., int32_t flag)
{
  int32_t proc_flags = atomic_read(&dax_ctx->proc_flags);
  return !!(proc_flags & flag);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so far the best approach in my mind, although it is not ideal.
(The lock should be grasped by process during the whole function)

But it should be fair enough given process calls will be much longer than reset

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack - DP/LL are different threads. DP thread is always preempted by synchronous LL thread every 1ms. So this solution looks fine to protect shared module resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DP thread is always preempted by synchronous LL thread every 1ms.

Oh that is critical. So basically we can assume the variable access on both threads won't get conflict. Then I assume the lock/atomic can be deferred.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still add the atomic flag as we could race in tearing down old DAX DP thread whilst creating new DAX DP thread as we transition. LL would interrupt this DP transition too if other audio is being processed and we have Linux userspace opening/closing ALSA PCMs too (which will be subject to user scheduling).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The probability is extremely small due to DP thread is always preempted by synchronous LL thread every 1ms, but that doesn't mean it's absolutely impossible. I will submit another PR to protect dax_ctx->update_flags.

source_release_data(source, consumed_bytes);

check_and_update_settings(mod);
if (dax_ctx->enable && !dax_ctx->p_dax) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this case happen?
Is dax_init a non-blocking function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sequenceDiagram
    participant DP as DP Thread
    participant LL as LL Thread

    DP->>DP: Running sof_dax_process
    activate DP
    LL->>LL: Starts sof_dax_reset
    activate LL
    LL->>LL: sof_dax_reset is done
    deactivate LL
    DP->>DP: Calls check_and_update_state
    activate DP
    DP->>DP: check_and_update_state returns
    deactivate DP
    DP->>DP: sof_dax_process is done
    deactivate DP
    DP->>DP: sof_dax_process starts again
    activate DP
Loading

See the diagram, I'm worried that sof_dax_process will run another loop before being stopped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should be unreachable.

During the reset process, module_adapter will help on schedule_task_cancel https://github.com/thesofproject/sof/blob/main/src/audio/module_adapter/module/generic.c#L642
which will direct to zephyr_dp_schedule in our case. The task will be removed from the schedule list once reset done, which means sof_dax_process won't be scheduled after then.

Having said that, it does no harm to add the safe check. So LGTM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove this checking.

@johnylin76
Copy link
Contributor

well, I guess that sof_dax_set_configuration and sof_dax_process are also asynchronized. If so, I think I could submit another PR to fix asynchronized problem using spinlock or atomic_t. How do you think? @johnylin76 @lgirdwood

Yes. Given the module runs in DP mode, once the process task is scheduled we should not modify a bit from LL which will be read/written during process. So yes, the potential problem is bigger than we imagined.

In most cases we are good except the new change that we will release the data buffer on reset. In module_adapter_reset, the DP scheduler will be cancelled, while letting the running event (if there is one) finish itself. That is, reset won't be blocked on process which may cause buffer being released but still under processing.

I do agree the implementation with lock/atomic can be deferred as long as the buffer release issue above is being taken cares.

return -EINVAL;
}

dax_ctx->update_flags |= DAX_PROCESSING_MASK;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DP thread is always preempted by synchronous LL thread every 1ms.

Oh that is critical. So basically we can assume the variable access on both threads won't get conflict. Then I assume the lock/atomic can be deferred.

At present, dax instance is initialized and allocated inner buffer
on module_init(), and released on module_free(). The memory
requirement for the inner buffer per instance is quite large.

The existing implementation has 2 pipelines containing dax. Even
though the host is restricted from processing stream on both
pipelines simultaneously, they may coexist with each other under some
circumstances e.g. the interim when switching PCM device from one to
the other, which may drain out the memory.

This commit changes the timing of instance allocation/deallocation to
module_prepare()/reset() respectively. That is, dax instance only
occupies the inner buffer memory when processing.

Reported-by: Johny Lin <johnylin@google.com>
Signed-off-by: Jun Lai <jun.lai@dolby.com>
In DP domain, sof_dax_process runs in a DP thread,
other sof_dax_* functions run in a different LL
thread. Using a atomic flags instead of original
dax_ctx->update_flags to avoid potential race
conditions when updating flags.

Since dax_inf.h should not have any dependencies
on SOF, we define aotmic flags in dax.c

Signed-off-by: Jun Lai <jun.lai@dolby.com>
@checkupup
Copy link
Contributor Author

Hi, @johnylin76 @lgirdwood I append another commit that implement aotmic flags in this PR, please help to review, very thx.

return atomic_read(flags) & flag;
case DAX_FLAG_ADD:
if ((atomic_read(flags) & flag) == 0)
atomic_add(flags, flag);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this provides the synchronisation that you're trying to achieve here. The thread can be preempted between the two above lines. In general it's very difficult to implement synchronisation only with atomic variables. I think you really need a proper lock here, if you really need to protect against a race here. Also note, that yesterday a PR got merged that moves all DP processing in SOF to the userspace. With that your DAX DP thread will run in userspace too. Also with that your .reset() method now will be called in the same thread context synchronously with processing. So, as of today I don't think this race can appear, but please re-test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants